Skip to content

feat(sparserl-sync): integrate SparseRL-Sync sparse diff tracking into optimizer#5

Open
tonic-scitix wants to merge 1 commit into
scitix:sirl-devfrom
tonic-scitix:feat/sparserl-sync
Open

feat(sparserl-sync): integrate SparseRL-Sync sparse diff tracking into optimizer#5
tonic-scitix wants to merge 1 commit into
scitix:sirl-devfrom
tonic-scitix:feat/sparserl-sync

Conversation

@tonic-scitix
Copy link
Copy Markdown

@tonic-scitix tonic-scitix commented May 15, 2026

feat(sparserl-sync): integrate SparseRL-Sync sparse diff tracking into optimizer

What does this PR do?

Instrument the Megatron optimizer's in-place weight-copy sites so the
SparseRL-Sync package can compute per-param sparse-update indices for each
rollout. When sparse_update is not installed the imports fall back to
nullcontext no-ops and the upstream behaviour is fully preserved.

This PR is one of three coordinated changes:

Modifications

megatron/core/optimizer/distrib_optimizer.py

  • Best-effort from sparse_update import init_sparse_manager, sparse_diff_context; falls back to nullcontext / no-op when not installed.
  • _build_model_and_main_param_groups: calls init_sparse_manager(model_param, shard_model_weight, shard_main_weight, param_range) for both fp16/bf16 and fp32 param groups so the SparseManager knows which DP-local buffer offset maps to which model param.
  • _copy_main_params_to_model_params: wraps shard_model_param.data.copy_(shard_main_param) with sparse_diff_context(shard_model_param, shard_main_param) so the manager snapshots pre/post state and emits per-param sparse indices.

megatron/core/optimizer/cpu_offloading/hybrid_optimizer.py

  • Best-effort from sparse_update import sparse_diff_context.
  • Wraps both copy-back hooks (param_copy_back_gpu_hook H2D copy and fp32_param_copy_back_gpu_hook fp32→fp16 copy) with sparse_diff_context(target, source).

megatron/core/pipeline_parallel/p2p_communication.py

  • Passes the group positional argument to all four torch.distributed.P2POp(...) constructors. It was previously omitted, which causes deadlocks when SparseRL-Sync's NCCL group overlaps with the world group during sparse weight-sync.

Pre-checks

  • I want this PR in a versioned release and have added the appropriate Milestone
  • I have added relevant unit tests
  • I have added relevant functional tests
  • I have added proper typing to my code
  • I have added relevant documentation
  • I have run the autoformatter.sh on my PR

Notes for reviewers

  • All three import blocks are try/except ImportError; removing the
    sparse_update package from the environment restores byte-identical
    upstream behaviour with zero code changes.
  • init_sparse_manager is a one-time bind per shard slice; it does not
    allocate any GPU memory itself.
  • sparse_diff_context is a lightweight CUDA-stream-safe context manager; it
    does not add synchronisation points beyond the existing stream ordering.
  • The p2p_communication.py fix is independent of SparseRL-Sync and is safe
    to cherry-pick separately.

…o optimizer

Wrap the in-place param.copy_() calls in DistributedOptimizer and
HybridDeviceOptimizer with sparse_diff_context() so the attached
SparseManager can snapshot pre/post state and build per-param sparse-update
indices for the Trainer→Rollout weight broadcast.

Also pass the communication group to P2POp constructors in p2p_communication.py
(group was omitted; required for non-default PG usage).

Co-Authored-By: Claude <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant